-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ObjectStoreLocationProvider
by default
#1509
Use ObjectStoreLocationProvider
by default
#1509
Conversation
@@ -297,8 +297,8 @@ def test_object_storage_location_provider_excludes_partition_path( | |||
tbl = _create_table( | |||
session_catalog=session_catalog, | |||
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}", | |||
# write.object-storage.partitioned-paths defaults to True | |||
properties={"format-version": str(format_version), TableProperties.OBJECT_STORE_ENABLED: True}, | |||
# Both write.object-storage.enabled and write.object-storage.partitioned-paths default to True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just changing this one integration test to test that it's now the default; I've kept test_writes::test_object_storage_data_files
unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wydt about adding assert statements as well? comments can become stale over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, done
Want to strongly highlight for folks reviewing (@kevinjqliu + @Fokko?) that I've not changed So do we want to do this? My thinking was: after reading through apache/iceberg#11112 it seems there was a strong case for supporting partition values in paths. When I thought about that, I realised that users might be relying on storage layout and files constituting partitions lying in the same directory - maybe just so that they can inspect storage to see how their files are actually laid out at a given point in time, especially in the context of partition evolution. On the other hand, maybe we should make this change to discourage reliance on file storage layout. (It would also bring object storage performance improvements by default which is a plus) |
The main downside of this PR is that file paths and their directory structures are more obscured by default with a bunch of hash directories inserted in the middle. But IMHO this is a reasonable tradeoff (and doesn't feel as dramatic as excluding partition information from file paths entirely). It won't change existing data file paths and they'll still work fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
In terms of the default value for WRITE_OBJECT_STORE_PARTITIONED_PATHS
, here are the four different scenarios. (source)
WRITE_OBJECT_STORE_PARTITIONED_PATHS=True
for Partitioned Table:f"{partition_key.to_path()}/{data_file_name}"
WRITE_OBJECT_STORE_PARTITIONED_PATHS=True
for NonPartitioned Table:f"{prefix}/{hashed_path}/{data_file_name}"
WRITE_OBJECT_STORE_PARTITIONED_PATHS=False
for Partitioned Table:f"{prefix}/{hashed_path}-{data_file_name}"
WRITE_OBJECT_STORE_PARTITIONED_PATHS=False
for NonPartitioned Table:f"{prefix}/{hashed_path}-{data_file_name}"
Option 1 aligns with the current SimpleLocationProvider
implementation for Partitioned Table. I think there is some benefits of having the partition_key.to_path()
as the default since it's easier to reason about than a bunch of hashes. Furthermore, I look at ObjectStoreLocationProvider
as S3/Object Store optimization and should only be turned on in specific scenarios.
My preference would be to keep WRITE_OBJECT_STORE_PARTITIONED_PATHS=True
to satisfy option 1.
pyiceberg/table/__init__.py
Outdated
@@ -190,7 +190,7 @@ class TableProperties: | |||
WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" | |||
|
|||
OBJECT_STORE_ENABLED = "write.object-storage.enabled" | |||
OBJECT_STORE_ENABLED_DEFAULT = False | |||
OBJECT_STORE_ENABLED_DEFAULT = True # Differs from the Java implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of the comment here, let's add it to the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -297,8 +297,8 @@ def test_object_storage_location_provider_excludes_partition_path( | |||
tbl = _create_table( | |||
session_catalog=session_catalog, | |||
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}", | |||
# write.object-storage.partitioned-paths defaults to True | |||
properties={"format-version": str(format_version), TableProperties.OBJECT_STORE_ENABLED: True}, | |||
# Both write.object-storage.enabled and write.object-storage.partitioned-paths default to True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wydt about adding assert statements as well? comments can become stale over time
Small correction @kevinjqliu - the function recurses on this line with this value being the new file name - it doesn't return it. So, hashes are still prepended to the file name after the prefix, and it's not the same as when To clarify, this isn't a bug. The I'm realising maybe we should add a unit test for this case (Case 1) even though the Java implementation doesn't have one. We do have https://github.com/apache/iceberg-python/blob/main/tests/table/test_locations.py#L85-L86 but it doesn't test that entropies were injected for object storage. I've put up #1511. Let me know if your thoughts regarding a potential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We'd probably want to document this but i see #1510 is opened to track
Thanks for the quick follow up @smaheshwar-pltr, and thanks for the review @kevinjqliu 🙌 |
Following the discussion in #1452 (comment), this PR proposes diverging from the historical, Java-side preference of disabling object storage location provision by default, and proposes setting the
OBJECT_STORE_ENABLED_DEFAULT
table property default toTrue
in PyIceberg.